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Abstract. In this paper we propose a novel algorithm, factored value itera- 
tion (F VI) , for the approximate solution of factored Markov decision processes 
(fMDPs) . The traditional approximate value iteration algorithm is modified in 
two ways. For one, the least-squares projection operator is modified so that it 
docs not increase max-norm, and thus preserves convergence. The other mod- 
ification is that we uniformly sample polynomially many samples from the 
(exponentially large) state space. This way, the complexity of our algorithm 
becomes polynomial in the size of the fMDP description length. We prove that 
the algorithm is convergent. We also derive an upper bound on the difference 
between our approximate solution and the optimal one, and also on the error 
introduced by sampling. We analyze various projection operators with respect 
to their computation complexity and their convergence when combined with 
approximate value iteration. 

factored Markov decision process, value iteration, reinforcement learning 

1. Introduction 

Markov decision processes (MDPs) are extremely useful for formalizing and solv- 
ing sequential decision problems, with a wide repertoire of algorithms to choose 
from [4, 26]. Unfortunately, MDPs are subject to the 'curse of dimensionality' [3]: 
for a problem with m state variables, the size of the MDP grows exponentially 
with m, even though many practical problems have polynomial-size descriptions. 
Factored MDPs (fMDPs) may rescue us from this explosion, because they offer a 
more compact representation [17, 5, 6]. In the fMDP framework, one assumes that 
dependencies can be factored to several easy-to-handle components. 

For MDPs with known parameters, there are three basic solution methods (and, 
naturally, countless variants of them): value iteration, policy iteration and linear 
programming (see the books of Sutton & Barto [26] or Bertsekas & Tsitsiklis [4] 
for an excellent overview). Out of these methods, linear programming is generally 
considered less effective than the others. So, it comes as a surprise that all effective 
fMDPs algorithms, to our best knowledge, use linear programming in one way or 
another. Furthermore, the classic value iteration algorithm is known to be divergent 
when function approximation is used [2, 27], which includes the case of fMDPs, too. 

In this paper we propose a variant of the approximate value iteration algorithm 
for solving fMDPs. The algorithm is a direct extension of the traditional value iter- 
ation algorithm. Furthermore, it avoids computationally expensive manipulations 
like linear programming or the construction of decision trees. We prove that the 
algorithm always converges to a fixed point, and that it requires polynomial time 
to reach a fixed accuracy. A bound to the distance from the optimal solution is 
also given. 

In Section 2 we review the basic concepts of Markov decision processes, includ- 
ing the classical value iteration algorithm and its combination with linear function 



2 



ISTVAN SZITA AND ANDRAS LORINCZ 



approximation. We also give a sufficient condition for the convergence of approxi- 
mate value iteration, and list several examples of interest. In Section 3 we extend 
the results of the previous section to fMDPs and review related works in Section 4. 
Conclusions are drawn in Section 5. 



2. Approximate Value Iteration in Markov Decision Processes 

2.1. Markov Decision Processes. An MDP is characterized by a sixtuple (X, A, R, P, 3 

where X is a finite set of states;-'" A is a finite set of possible actions; i? ; X x ^4 ^ M 
is the reward function of the agent, so that i?(x, a) is the reward of the agent after 
choosing action a in state x; P : X x A x X ^ [0, 1] is the transition function so 
that P(y I X, a) is the probability that the agent arrives at state y, given that she 
started from x upon executing action a; x^ G X is the starting state of the agent; 
and finally, 76 [0, 1) is the discount rate on future rewards. 

A policy of the agent is a mapping n : X x ^4 [0,1] so that 7r(x,a) tells 
the probability that the agent chooses action a in state x. For any Xq G X, the 
policy of the agent and the parameters of the MDP determine a stochastic process 
experienced by the agent through the instantiation 

xo,ao,ro,xi,ai,ri,. . . ,xt,at,rt, . . . 

The goal is to find a policy that maximizes the expected value of the discounted 
total reward. Let the value function of policy tt be 



00 

T/-(x) :=i?(^7*n I x=Xo) 
t=o 



and let the optimal value function be 

F*(x) := maxy'^(x) 

TT 

for each x G X. If V* is known, it is easy to find an optimal policy tt*, for which 
y-n = Y*_ Provided that history does not modify transition probability distri- 
bution P(y|x, a) at any time instant, value functions satisfy the famous Bellman 
equations 

(1) y-(x) = ^ ^ 7r(x, a)P(y | x, a) (p(x, a) + jV^y)) 

a y 

and 

(2) V* (x) = max ^ P(y I x, a) (p(x, a) + ^V* (y)) . 

y 

Most algorithms that solve MDPs build upon some version of the Bellman equa- 
tions. In the following, we shall concentrate on the value iteration algorithm. 



Later on, wc shall generalize the concept of the state of the system. A state of the system 
will be a vector of state variables in our fMDP description. For that reason, we already use the 
boldface vector notation in this preliminary description. 
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2.2. Exact Value Iteration. Consider an MDP (X, A, P, i?, x^, 7). The value 
iteration for MDPs uses the Bellman equations (2) as an iterative assignment: It 
starts with an arbitrary value function Vq : X ^ R, and in iteration t it performs 
the update 

(3) Vt+i{^) := max ^ P(y | x, a) (i?(x, a) + jVt{y)) 

yex 

for all X € X. For the sake of better readability, we shall introduce vector no- 
tation. Let N := |X|, and suppose that states are integers from 1 to N, i.e. 
X = {1,2, .. . ,N}. Clearly, value functions are equivalent to A''-dimensional vec- 
tors of reals, which may be indexed with states. The vector corresponding to V 
will be denoted as v and the value of state x by Vx- Similarly, for each a let us 
define the A'^-dimensional column vector r" with entries r° = i?(x, a) and N x N 
matrix with entries P^y = P(y | x, a). With these notations, (3) can be written 
compactly as 

(4) Vf+i := max„e^ (r" + 7PVt) . 

Here, max denotes the componentwise maximum operator. 

It is also convenient to introduce the Bellman operator T : — > that maps 
value functions to value functions as 

Tv := maXaeA (r" + 7^"v) . 

As it is well known, T is a max-norm contraction with contraction factor 7: for 
any v, u G K^, ||Tv — Tu||oo < 7||v — u||oo- Consequently, by Banach's fixed point 
theorem, exact value iteration (which can be expressed compactly as Vt+i := Tvt) 
converges to an unique solution v* from any initial vector Vq, and the solution v* 
satisfies the Bellman equations (2). Furthermore, for any required precision e > 0, 
||vt — v*||(x) < e if i > i^^||vo — v*||(x). One iteration costs 0{N^- \A\) computation 
steps. 

2.3. Approximate value iteration. In this section we shall review approximate 
value iteration (AVI) with linear function approximation (LFA) in ordinary MDPs. 
The results of this section hold for AVI in general, but if we can perform all oper- 
ations effectively on compact representations (i.e. execution time is polynomially 
bounded in the number of variables instead of the number of states), then the 
method can be directly applied to the domain of factorized Markovian decision 
problems, underlining the importance of our following considerations. 

Suppose that we wish to express the value functions as the linear combination 
of K basis functions /i^ : X ^ IR (fc e {1, . . . , K}), where K « N. Let H be the 
N X K matrix with entries H^.k — ^fe(x). Let wj e M^^ denote the weight vector 
of the basis functions at step t. We can substitute = Hwt into the right hand 
side (r.h.s.) of (4), but we cannot do the same on the left hand side (l.h.s.) of the 
assignment: in general, the r.h.s. is not contained in the image space of H, so there 
is no such Wt_|_i that 

Hwt+i = max„e^(r'^ + jP'^Hwt). 

We can put the iteration into work by projecting the right-hand side into w-space: 
let a : R^ ^ be a (possibly non-linear) mapping, and consider the iteration 

(5) wt+i := g [maoCaeA (r" + jP^Hwt)] 
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with an arbitrary starting vector Wq. 

Lemma 2.1. If Q is such that HQ is a non- expansion, i.e., for any v,v' e , 

||i/gv-i/gv'|U < ||v-v'||oo, 
then there exists a w* € M.^ such that 

w* = g[maxaeA(r" + 7^"^w*)] 
and iteration (5) converges to w* from any starting point. 

Proof. We can write (5) compactly as Wj+i = QTHwt. Let Vt = Hwt. This 
satisfies 

(6) vt+i = HQTvt. 

It is easy to see that the operator HQT is a contraction: for any v, v' e M.^ , 
WHgTv-HGTv'W^ < ||Tv-Tv'|U <7l|v-v'|U 

by the assumption of the lemma and the contractivity of T. Therefore, by Banach's 

fixed point theorem, there exists a vector v* G such that v* = HQTv* and 
iteration (6) converges to v* from any starting point. It is easy to see that w* = 
QTv* satisfies the statement of the lemma. 

□ 

Note that if 5 is a linear mapping with matrix G € R^^-'^, then the assumption 
of the lemma is equivalent to ||i?G||oo < 1- 

2.4. Examples of Projections, Convergent and Divergent. In this section, 
we examine certain possibilities for choosing projection Q. Let v G be an 
arbitrary vector, and let w = Qv be its ^-projection. For linear operators, Q can 
be represented in matrix form and wc shall denote it by G. 

Least-squares (L2-)projection. Least-squares fitting is used almost exclu- 
sively for projecting value functions, and the term AVI is usually used in the sense 
"AVI with least-squares projection" . In this case, w is chosen so that it minimizes 
the least-squares error: 

w := argmin \\Hw — v||2. 

w 

This corresponds to the linear projection G2 = (i.e., w = H~^v), where is 
the Moore-Penrose pseudoinverse of H. It is well known, however, that this method 
can diverge. For an example on such divergence, see, e.g. the book of Bertsekas & 
Tsitsiklis [4]. The reason is simple: matrix HH~^ is a non-expansion in L2-iiorm, 
but Lemma 1 requires that it should be an ico-uorm projection, which does not 
hold in the general case. (See also Appendix A.l for illustration.) 

Constrained least-squcires projection. One can enforce the non-expansion 
property by expressing it as a constraint: Let w be the solution of the constrained 
minimization problem 

w := argmin \\Hw - v\\l, subject to ||i?w||oo < ||v||<x,, 

w 

which defines a non- linear mapping Q2. This projection is computationally highly 
demanding: in each step of the iteration, one has to solve a quadratic programming 
problem. 
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Max-norm (Loo-)projection. Similarly to L2-piojection, we can also select w 
so that it minimizes the max-norm of the residual: 

w := argmin \\Hw — v\\oo- 

w 

The computation of w can be transcribed into a linear programming task and that 
defines the non-linear mapping Q^. However, in general, ||-ff^oov||c!o ^ ||v||cx)) and 
consequently AVI using iteration 

Wt+i := argmin \\Hw - THwt\\c^ 

w 

can be divergent. Similarly to L2 projection, one can also introduce a constrained 
version defined by 

^f_v := argmin ||iJw-v||oo, subject to ||-H"w||oo < \W\\oo, 

w 

which can also be turned into a linear program. 

It is insightful to contrast this with the approximate linear programming method 
of Guestrin et al. [13]: they directly minimize the max-norm of the Bellman error, 
i.e., they solve the problem 

\Hw-THw\\ 

which can be solved without constraints. 

Z/i-norm projection. Let Qli be defined by 

QiV := argmin \\Hw — v||i. 

w 

The ii-norm projection also requires the solution of a linear program, but inter- 
estingly, the projection operator Qi is a non-expansion (the proof can be found in 
Appendix A.l). 

AVI-compatible operators considered so far {Q2, Q% and Q\) were non-linear, and 
required the solution of a linear program or a quadratic program in each step of 
value iteration, which is clearly cumbersome. On the other hand, while G^y = i?+v 
is linear, it is also known to be incompatible with AVI [2, 27]. Now, we shall focus 
on operators that arc both AVI-compatible and linear. 

Normalized linear mapping. Let G be an arbitrary KxN matrix, and define 
its normalization M{G) as a matrix with the same dimensions and entries 

^i,3 



(E,'|^?i,,'|)(Ei'|G.',il) 



that is, N{G) is obtained from G by dividing each clement with the corresponding 
row sum of H and the corresponding column sum of G. All (absolute) row sums 
of H ■ Af{G) are equal to 1. Therefore, (i) \\H ■ Af{G)\\oo = 1, and (ii) H ■ N{G) is 
maximal in the sense that if the absolute value of any element of M(G) increased, 
then for the resulting matrix G', \\H ■ G"||oc > 1- 

Probabilistic linear mapping. If all elements of H are nonnegative and all 
the row-sums of H are equal, then M{H'^) assumes a probabilistic interpretation. 
This interpretation is detailed in Appendix A. 2. 

Normalized least-squares projection. Among all linear operators, is 
the one that guarantees the best least- squares error, therefore we may expect that 
its normalization, Af(H~^) plays a similar role among AVI-compatible linear projec- 
tions. Unless noted otherwise, we will use the projection Af{H~^) subsequently. 
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2.5. Convergence properties. 

Lemma 2.2. Let v* be the optimal value function and w* be the fixed point of the 
approximate value iteration (5). Then 

||Fw* - v*|U < -^WHGw* - v*||oo. 

1-7 

Proof. For the optimal value function, v* = Tv* holds. On the other hand, w* = 
GTHw*. Thus, 

ll^fw* -v*||oo = \\HgTHw* -Tv*\\^ 

< WHGTHw* - HgTv*\\oo + WHQTv" - Tv*||<x> 

< \\THw* ~Tv*\\^ + \\Hgv* 

< 7||i7w* - v*||,3c. + ll-ff^v* - v*||3c. 

from which the statement of the lemma follows. For the transformations we have 
applied the triangle inequality, the non-expansion property of HQ and the contrac- 
tion property of T. □ 

According to the lemma, the error bound is proportional to the projection error 
of V*. Therefore, if v* can be represented in the space of basis functions with small 
error, then our AVI algorithm gets close to the optimum. Furthermore, the lemma 
can be used to check a posteriori how good our basis functions are. One may 
improve the set of basis functions iteratively. Similar arguments have been brought 
up by Guestrin et al. [13], in association with their LP-based solution algorithm. 

3. Factored value iteration 

MDPs are attractive because solution time is polynomial in the number of states. 
Consider, however, a sequential decision problem with m variables. In general, we 
need an exponentially large state space to model it as an MDP. So. the number 
of states is exponential in the size of the description of the task. Factored Markov 
decision processes may avoid this trap because of their more compact task repre- 
sentation. 

3.1. Factored Markov decision processes. We assume that X is the Cartesian 
product of m smaller state spaces (corresponding to individual variables): 

X = Xi X A2 X . . . X A„j . 

For the sake of notational convenience we will assume that each Xj has the same 
size, \Xi\ = IX2I = . . . = \Xm\ = n. With this notation, the size of the full state 
space is iV = |X| = n™. We note that all derivations and proofs carry through to 
different size variable spaces. 

A naive, tabular representation of the transition probabilities would require expo- 
nentially large space (that is, exponential in the number of variables m). However, 
the next-step value of a state variable often depends only on a few other variables, 
so the full transition probability can be obtained as the product of several simpler 
factors. For a formal description, we introduce several notations: 

For any subset of variable indices Z C {1, 2, . . . , m}, let X[Z] := x Xi. further- 

iez 

more, for any x G X, let x[Z] denote the value of the variables with indices in Z. 
We shall also use the notation x[Z] without specifying a full vector of values x, in 
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such cases x[Z] denotes an element in X[Z]. For single-element sets Z = {«} we 
shall also use the shorthand x[{i}] = x[i]. 

A function / is a local-scope function if it is defined over a subspace X[2'] of the 
state space, where Z is a. (presumably small) index set. The local-scope function 
/ can be extended trivially to the whole state space by /(x) := /(x[Z]). If \Z\ 
is small, local-scope functions can be represented efficiently, as they can take only 
n'^l different values. 

Suppose that for each variable i there exist neighborhood sets Fj such that the 
value of Xt+i[i] depends only on Xt[rj] and the action at taken. Then we can write 
the transition probabilities in a factored form 

n 

(7) P(y |x,a) = I]Pi(y[i] |x[r,],a) 

for each x,y G X., a G A, where each factor is a local-scope function 

(8) Pi : XfFi] xAxXi^[0, 1] (for alH e {1, . . . , m}). 

We will also suppose that the reward function is the sum of J local-scope functions: 

J 

(9) R{x,a) = Y,RMZj],a), 

with arbitrary (but preferably small) index sets Zj, and local-scope functions 

(10) Rj :X[Zj]x A^R (for all j e {1, . . . , J}). 

To sum up, a factored Markov decision process is characterized by the parameters 
({Xi : 1 < i < m};^;{i?, : 1 < .? < J}; {F^ : 1 < i < n};{Pi : 1 < i < n};Xs;i), 
where Xg denotes the initial state. 

Functions Pi and Ri are usually represented either as tables or dynamic Bayesian 
networks. If the maximum size of the appearing local scopes is bounded by some 
constant, then the description length of an fMDP is polynomial in the number of 
variables n. 

3.1.1. Value functions. The optimal value function is an A'' = n"*-dimensional vec- 
tor. To represent it efficiently, we should rewrite it as the sum of local-scope 
functions with small domains. Unfortunately, in the general case, no such factored 
form exists [13]. 

However, we can still approximate V* with such an expression: let K be the 
desired number of basis functions and for each k e {1, . . . ,K}, let Ck be the 
domain set of the local-scope basis function hk ■ X[Cft] — > R. We are looking for a 
value function of the form 

K 

(11) V{:>c) = ^Wk-hk{x[Ck]). 

fe=i 

The quality of the approximation depends on two factors: the choice of the basis 
functions and the approximation algorithm. Basis functions are usually selected 
by the experiment designer, and there are no general guidelines how to automate 
this process. For given basis functions, we can apply a number of algorithms to 
determine the weights Wk- We give a short overview of these methods in Section 4. 
Here, we concentrate on value iteration. 
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3.2. Exploiting factored structure in value iteration. For fAIDPs, wc can 
substitute the factored form of the transition probabilities (7), rewards (9) and the 
factored approximation of the value function (11) into the AVI formula (5), which 
yields 



'^hh{x[Ch]) ■ Wk,t+i ~ max 

fe=l " yeX i=l 

J K 

(^i?,(x[Z,],a) + 7 ^ hk'{y[Ck']) ■ Wk',t)- 



j=i k'=i 

By rearranging operations and exploiting that all occurring functions have a local 
scope, we get 



K 



E/ifc(x[Cfe]) • Wk,t+i = ^femax 
a 



fe=l 

K 



J 



.i=i 



(12) +7 5^ {XI Pi{y[^\A^ila))hMCk'])-Wk',t 

fc'=iy[Cfc']ex[Cfc/] idCy 

for all X e X. We can write this update rule more compactly in vector notation. 
Let 

Wt := (wi,(, M;2,t, . . . , WK,t) e K^, 

and let H be an |X| x K matrix containing the values of the basis functions. We 
index the rows of matrix H by the elements of X: 

-ffx,fc := /ife(x[Cfe]). 

Further, for each a& A, let be the |X| x ii' value backprojection matrix defined 
as 

■■= E ( n ^^(yW I x[r,],a))ft,(y[Cfc]) 

y[Cfe]ex[Cfc] ieCfc 

and for each a, define the reward vector r" S M'-^I by 

r«:=^i?,(x[Z,],a). 

Using these notations, (12) can be rewritten as 

(13) wt+i := QmaXaeA (r" + 7-B"wt) . 

Now, all entries of B", H and r" are composed of local-scope functions, so 

any of their individual elements can be computed efficiently. This means that the 
time required for the computation is exponential in the sizes of function scopes, 
but only polynomial in the number of variables, making the approach attractive. 
Unfortunately, the matrices arc still exponentially large, as there arc exponentially 
many equations in (12). One can overcome this problem by sampling as we show 
below. 
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3.3. Sampling. To circumvent tiie problem of having exponentially many equa- 
tions, we select a random subset X C X of the original state space so that 
|X| — poly(m), consequently, solution time will scale polynomially with m. On 
the other hand, we will select a sufficiently large subset so that the remaining sys- 
tem of equations is still over-determined. The necessary size of the selected subset 
is to be determined later: it should be as small as possible, but the solution of 
the reduced equation system should remain close to the original solution with high 
probability. For the sake of simplicity, we assume that the projection operator Q is 
linear with matrix G. Let the sub-matrices of G, H, B"^ and r" corresponding to 
X be denoted by G, H, i?" and r°, respectively. Then the following value update 

(14) wt+i G ■ maxaeA (? " + 7S"wt) 

can be performed effectively, because these matrices have polynomial size. Now we 
show that the solution from sampled data is close to the true solution with high 
probability. 

Theorem 3.1. Let w* be the unique solution of w* = GTHw* of an FMDP, 

and let w' be the solution of the corresponding equation with sampled matrices, 

w' = GTHw'. Suppose that the projection matrix G has a factored structure, 

too. Then iteration (14) converges to w', furthermore, for a suitable constant S 

(depending polynomially on , where z is the maximum cluster size), and for any 

e,5 > Q, ||w* — w'lloo < e holds with probability at least 1 — 5, if the sample size 

. „ »T , m 

satisfies Ni > ^^r- log — . 



The proof of Theorem 3.1 can be found in Appendix A. 3. The derivation is 
closely related to the work of Drineas and colleagues [11, 12], although we use the 
infinity-norm instead of the L2-iiorm. A more important difference is that we can 
exploit the factored structure, gaining an exponentially better bound. The resulting 
factored value iteration algorithm is summarized in Table 1. 



Algorithm 1 Factored value iteration with a linear projection matrix G. 
% inputs: 

% factored MDP, M - ({XJ^i; A; {Fjr^i; {PJ™ i; x,; 7) 

% basis functions, {hk}f^i 
% required accuracy, e 
A^o := number of samples 

X := uniform random A^o-element subset of X 
create H and^G^ 

create = P"H and r" for each a G A 

Wo = 0, t := 

repeat 

Wt+i := G ■ maxf r" + jB'^-Wt) 

At := ||wt+i - wtlloo 

t:=t+l 
until A( > e 
return Wt 
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4. Related work 

The exact solution of factored MDPs is infeasible. The idea of representing a 
large MDP using a factored model was first proposed by Kollcr & Parr [17] but 
similar ideas appear already in the works of Boutilier, Dearden, & Goldszmidt 
[5, 6]. More recently, the framework (and some of the algorithms) was extended to 
fMDPs with hybrid continuous-discircte variables [18] and factored partially observ- 
able MDPs [23]. Furthermore, the framework has also been applied to structured 
MDPs with alternative representations, e.g., relational MDPs [15] and first-order 
MDPs [24]. 

4.1. Algorithms for solving factored MDPs. There are two major branches 
of algorithms for solving fMDPs: the first one approximates the value functions as 
decision trees, the other one makes use of linear programming. 

Decision trees (or equivalently, decision lists) provide a way to represent the 
agent's policy compactly. KoUer & Parr [17] and Boutilier et al. [5, 6] present 
algorithms to evaluate and improve such policies, according to the policy iteration 
scheme. Unfortunately, the size of the policies may grow exponentially even with a 
decision tree representation [6, 20]. 

The exact Bellman equations (2) can be transformed to an equivalent linear 
program with TV variables {V^(x) : x G X} and N ■ \A\ constraints: 

maximize: a(x)y(x) 
xex 

subject to V{x) < R{x, o) + 7 ^ P(x' [ x, a)V{x'), (Vx e X, a e A). 

x'ex 

Here, weights aix.) are free parameters and can be chosen freely in the following 
sense: the optimum solution is V* independent of their choice, provided that each 
of them is greater than 0. In the approximate linear programming approach, we 
approximate the value function as a linear combination of basis functions (11), 
resulting in an approximate LP with K variables {wk : 1 < /c < K} and N ■ \A\ 
constraints: 

K 

(IS^aximize: Wk ■ a(x)/i/c(x[Cfc]) 

fc=i xex 

K 

subject to '^Wk- hk{-x.[Ck]) < 

K 

< i?(x,a) + 7 ^ Wk' ^ P(x' I x,a) • hk'{x'[Ck']). 
k'=i x'ex 

Both the objective function and the constraints can be written in compact forms, 

exploiting the local-scope property of the appearing functions. 

Markov decision processes were first formulated as LP tasks by Schweitzer and 
Seidmann [25]. The approximate LP form is due to de Farias and van Roy [7]. 
Guestrin et al. [13] show that the maximum of local-scope functions can be com- 
puted by rephrasing the task as a non-serial dynamic programming task and elim- 
inating variables one by one. Therefore, (15) can be transformed to an equivalent. 



FACTORED VALUE ITERATION CONVERGES 



11 



more compact linear program. The gain may be exponential, but this is not nec- 
essarily so in all cases: according to Guestrin et al. [13], "as shown by Dechter 
[9], [the cost of the transformation] is exponential in the induced width of the cost 
network, the undirected graph defined over the variables Xi; . . . ; X„, with an edge 
between Xi and Xm if they appear together in one of the original functions fj . The 
complexity of this algorithm is, of course, dependent on the variable elimination 
order and the problem structure. Computing the optimal elimination order is an 
NP-hard problem [1] and elimination orders yielding low induced tree width do 
not exist for some problems." Furthermore, for the approximate LP task (15), the 
solution is no longer independent of a and the optimal choice of the a values is not 
known. 

The approximate LP-based solution algorithm is also due to Guestrin et al. [13]. 
Dolgov and Durfee [10] apply a primal-dual approximation technique to the linear 
program, and report improved results on several problems. 

The approximate policy iteration algorithm [17, 13] also uses an approximate 
LP reformulation, but it is based on the poUcy-evaluation Bellman equation (1). 
Policy-evaluation equations are, however, linear and do not contain the maximum 
operator, so there is no need for the second, costly transformation step. On the 
other hand, the algorithm needs an explicit decision tree representation of the 
policy. Liberatore [20] has shown that the size of the decision tree representation 
can grow exponentially. 

4.1.1. Applications. Applications of fMDP algorithms are mostly restricted to ar- 
tificial test problems like the problem set of Boutilier et al. [6], various versions of 
the Sys Admin task [13, 10, 21] or the New York driving task [23]. 

Guestrin, KoUer, Gearhart and Kanodia [15] show that their LP-based solution 
algorithm is also capable of solving more practical tasks: they consider the real- 
time strategy game FreeCraft. Several scenarios are modelled as fMDPs, and solved 
successfully. Furthermore, they find that the solution generalizes to larger tasks 
with similar structure. 

4.1.2. Unknown environment. The algorithms discussed so far (including our FVI 
algorithm) assume that all parameters of the fMDP are known, and the basis func- 
tions are given. In the case when only the factorization structure of the fMDP is 
known but the actual transition probabilities and rewards are not, one can apply 
the factored versions of [16] or R-max [14]. 

Few attempts exist that try to obtain basis functions or the structure of the 
fl\/IDP automatically. Patrascu et al. [21] select basis functions greedily so that 
the approximated Bellman error of the solution is minimized. Poupart et al. [22] 
apply greedy selection, too, but their selection criteria are different: a decision tree 
is constructed to partition the state space into several regions, and basis functions 
are added for each region. The approximate value function is piecewise linear in 
each region. The metric they use for splitting is related to the quality of the LP 
solution. 

4.2. Sampling. Sampling techniques are widely used when the state space is im- 
mensely large. Lagoudakis and Parr [19] use sampling without a theoretical analysis 
of performance, but the validity of the approach is verified empirically. De Farias 
and van Roy [8] give a thorough overview on constraint sampling techniques used 
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for the linear programming formulation. These techniques are, however, specific to 
linear programming and cannot be applied in our case. 

The work most similar to ours is that of Drineas et al. [12, 11]. They investi- 
gate the least-squares solution of an overdetermined Unear system, and they prove 
that it is sufficient to keep polynomially many samples to reach low error with 
high probability. They introduce a non-uniform sampling distribution, so that the 
variance of the approximation error is minimized. However, the calculation of the 
probabilities requires a complete sweep through all equations. 

5. Conclusions 

In this paper we have proposed a new algorithm, factored value iteration, for the 
approximate solution of factored Markov decision processes. The classical approx- 
imate value iteration algorithm is modified in two ways. Firstly, the least-squares 
projection operator is substituted with an operator that does not increase max- 
norm, and thus preserves convergence. Secondly, polynomially many samples are 
sampled uniformly from the (exponentially large) state space. This way, the com- 
plexity of our algorithm becomes polynomial in the size of the fMDP description 
length. We prove that the algorithm is convergent and give a bound on the differ- 
ence between our solution and the optimal one. We also analyzed various projec- 
tion operators with respect to their computation complexity and their convergence 
when cionibinecl with approximate value iteration. To our knowledge, this is the 
first algorithm that (1) provably converges in polynomial time and (2) avoids linear 
programming. 
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Appendix A. Proofs 

A.l. Projections in various norms. We wish to know whether wg = argmin^ ||i/w— 
v||p imphes ||i?Wo||oo < l|v||oo for various values of p. Specifically, we are interested 
in the cases when p G {1, 2, oo}. Fig. 1 indicates that the implication does not hold 
for p = 2 or p = oo, only for the case p — I. Below we give a rigorous proof for 
these claims. 

Consider the example v = [1, 1]"^ eR^, H = [1,2]^, w G For these values 
easy calculation shows that |1-H^[wo]l2 jjoo = 6/5 and |li^[wo]L^ ||oo = 4/3, i.e., 
||-f^wo||cx) ^ l|v||(x) for both cases. For p = 1, we shall prove the following lemma: 

Lemma A.l. //wq = argmin^ \\Hw — v||i, then \\Hwo\\oo < l|v||oo- 

Proof. Let z := Hwq e . If there are multiple solutions to the minimization 

task, then consider the (unique) z vector with minimum L2-iiorm. Let r := [[z — vl|i 
and let <S'(v, r) be the Li-sphere with center v and radius r (this is an A^-dimensional 
cross polytope or orthoplex, a generalization of the octahedron). 
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Figure 1. Projections in various norms. The vector v is projected 
onto the image space of H, i.e., the subspace defined by u = Hw. 
Consider the smallest sphere around v (in the corresponding norm) 
that touches the subspace u = Hw (shown in each figure). The 
radius tq of this sphere is the distance of v from the subspace, 
and the tangent point vq (which is not necessarily unique for Li 
projection) is the projection of v. For this point, vq = Hwq holds. 
The shaded region indicates the region {u : ||u||oo < ||v||oo}- To 
ensure the convergence of FVI, the projected vector Vq must fall 
into the shaded region. 

Suppose indirectly that ||z||oo > ||v||oo- Without loss of generality we may 
assume that zi is the coordinate of z with the largest absolute value, and that 
it is positive. Therefore, zi > ||v|joo- Let denote the i^^ coordinate vector 
(1 < i < N), and let — z ~ Sei. For small enough S, S{zs, S) is a cross polytope 
such that (a) S{zs,S) C 5(v,r), (b) Vz' e S{zs,S), \\z'\\oo > Moo, (c) Ve > 
sufficiently small, (1 — e)z g ^(z^,^). The first two statements are trivial. For the 
third statement, note that z is a vertex of the cross polytope S{zs,5). Consider the 
cone whose vertex is z and its edges are the same as the edges of S{zs, 6) joining 
z. It is easy to see that the vector pointing from z to the origo is contained in this 
cone: for each 1 < i < N, \zi\ < zi (as zi is the largest coordinate). Consequently, 
for small enough e, z — ez G 5(z5, (5). 

Fix such an e and let q = (1 — e)z. This vector is (a) contained in the image 
space of H because H[{1 — e)w] = q; (b) ||q — v||i < ||z — v||i = r. The vector 
z was chosen so that it has the smallest Li-norm in the image space oi H, so the 
inequality cannot be sharp, i.e., ||q — v||i = r. However, ||q||2 = (1 ^ £)NI|2 < ||z||2 
with strict inequality, which contradicts our assumption about z, thus completing 
the proof. □ 

A. 2. Probabilistic interpretation of N{H^). 

Definition A. 2. The basis functions {hk}^'Li have the uniform covering (UC) 
property, if all row sums in the corresponding H matrix are identical: 

rib 

^x.fc = B for all x G X, 

and all entries are nonnegative. Without loss of generality we may assume that all 
rows sum up to 1, i.e., H is a stochastic matrix. 
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We shall introduce an auxiliary MDP such that exact value iteration in 
is identical to the approximate value iteration in the original MDP M. Let S be 
an i^-element state space with states si, . . . ,s_r:- A state s is considered a discrete 
observation of the true state of the system, x e X. 

The action space A and the discount factor 7 are identical to the corresponding 
items of M., and an arbitrary element sq € S is selected as initial state. In order to 
obtain the transition probabilities, let us consider how one can get from observing 
s to observing s' in the next time step: from observation s, we can infer the hidden 
state X of the system; in state x, the agent makes action a and transfers to state 
x' according to the original MDP; after that, we can infer the probability that 
our observation will be s', given the hidden state x'. Consequently, the transition 
probability P(s' | s, a) can be defined as the total probability of all such paths: 

P(s' I s, a) := Pr(x | s) Pr(x' | x, a) Pr(s' | x). 

x,x'ex 

Here the middle term is just the transition probability in the original MDP, the 
rightmost term is -ffx,s) and the leftmost term can be rewritten using Bayes' law 
(assuming a uniform prior on x): 

p. I . ^ Pr(s|x)Pr(x) ^ gx.s-^ ^ H^^, 

^ ' ' Ex"6xPr(s|x")Pr(x") Ex"ex ^^x",s • ^ Ex-ex^^x-.s' 

Consequently, 

P(s'|s,a)= ^ P(x-|x,a)gx,,= [^(gfP"g]^^,. 

xxex-^x'-ex-^x'-.s 

The rewards can be defined similarly: 

P(s,a) := Y I s)P(x,a) = [N{Hfv'']^. 
xex 

It is easy to see that approximate value iteration in M. corresponds to exact value 

iteration in the auxiliary MDP M. 

A. 3. The proof of the sampling theorem (theorem 3.1). First we prove a 
useful lemma about approximating the product of two large matrices. Let A e 

K"^" and P G M"^'' and let C = A • P. Suppose that we sample columns of A 
uniformly at random (with repetition), and we also select the corresponding rows of 

B. Denote the resulting matrices with A and B. We will show that A-B k. c- A- B, 
where c is a constant scaling factor compensating for the dimension decrease of the 
sampled matrices. The following lemma is similar to Lemma 11 of [11], but here 
we estimate the infinity-norm instead of the i2-norm. 

Lemma A.3. Let A e W^''^, B e ]R^^'= and C = A - B. Let N' be an integer 
so that I < N' < N , and for each i G {1, . . . , N'}, let ri be an uniformly random 
integer from the interval [1,A^]. Let A G M™^^ be the matrix whose i*'' column 
is the ri*^ column of A, and denote by B the N' x k matrix that is obtained by 
sampling the rows of B similarly. Furthermore, let 
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Then, foranye,S> 0, ||C-C||oo < e^PII 
if the sample size satisfies N' > log ^ 



I -B"^ I loo with probability at least 1 — 6, 



Proof. We begin by bounding individual elements of the matrix C: consider the 
element 



N 



N' 



Let Cpq be the discrete probability distribution determined by mass points {Ap^i ■ 
Bi,q I 1 < i < N}. Note that Cpq is essentially the sum of N' random variables 
drawn uniformly from distribution Cpq. Clearly, 

\ApiBiq\ < max|^jj|max|Bjj| < max max = ||A||oo||B||<x>, 



SO we can apply HocfFding's inequality to obtain 



Pr 



N 



N' 



N 



> ei^ < 2exp(^ 



N'el 



or equivalently, 



Pr 



-[AB]pq-[AB], 



> iVei) < 2exp(^- 



N'el \ 



2\\A\\UB\\l,- 



where ei > is a constant to be determined later. Prom this, we can bound the 
row sums of C — C: 



p=i 



> m- Nei^ < 2mexp^- 



-)■ 



which gives a bound on \\C — C\\oo- This is the maximum of these row sums: 



Pr 



{\\c-c\u 



> mNei 



)=Pr(, 



max ^ I Cp 

' p=i 



> mNei^ < 2fcmexp^- 



N'ej 



Therefore, by substituting ei = e\\A\\oc\\B\\oc/ni, the statement of the lemma is 
satisfied if 2A:mexp(-f^) < 5, i.e, if iV' > ^ log □ 



If both A and B are structured, we can sharpen the lemma to give a much better 
(potentially exponentially bettor) bound. For this, we need the following definition: 

For any index set Z, a matrix A is called Z-local-scope matrix, if each column 
of A represents a local-scope function with scope Z. 

Lemma A. 4. Let A^ and B be local-scope matrices with scopes Z\ and Zi, and let 
Nq = nl^il+l^^l, and apply the random row/column selection procedure of the pre- 
vious lemma. Then, for any e,S > 0, \\C — C\\oo < e-^o||^||oo||-B||(x) with probability 
at least 1 — 6, if the sample size satisfies N' > log 



, 2km 



Proof. Fix a variable assignment x[Zi U Z2] on the domain Zi U Z2 and consider 
the rows of A that correspond to a variable assignment compatible to x[Zi U Z2], 
i.e., they are identical to it for components Zi U Z2 and are arbitrary on 

W:={l,2,...,m}\{Z,UZ2). 
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It is easy to see that all of these rows are identical because of the local-scope 
property. The same holds for the columns of B. All the equivalence classes of 
rows/columns have cardinality 

Ni := nl^l = N/No. 

Now let us define the mx Nq matrix A' so that only one column is kept from each 
equivalence class, and define the Nq x k matrix B' similarly, by omitting rows. 
Clearly, 

A-B = NiA' ■ B', 

and we can apply the sampling lemma to the smaller matrices A' and B' to get 
that for any e, ^ > and sample size N' > log with probability at least 
1-S, 

\\A^' - A' . S'lU < eNo\\A'\U\B'\U 
Exploiting the fact that the max-norm of a matrix is the max;imum of row norms, 
ll^'lloo = imicc/^1 and ||.B'||(X) = H-BHoo, we can multiply both sides to get 

WN^A^' - A ■ B\\^ < eNoN4A\\^/Ni\\B\\^ = eNo\\A\\^\\B\\^, 
which is the statement of the lemma. □ 

Note that if the scopes Zi and Z2 are small, then the gain compared to the 

previous lemma can be exponential. 

Lemma A. 5. Let A = Ai + . . . + Ap and B = Bi + . . . + Bq where all Ai and Bj 
are local-scope matrices with domain size at most z, and let Nq = n^. If we apply 

the random row/column selection procedure, then for any e,5 > 0, \\C — C\\co < 
eiVopgmaxj ||Ai||oomaxj ||i?j||oo with probability at least 1 — 6, if the sample size 
satisfies N' >^log^. 

Proof. 

\\c - c\\oo < EE ii^^Tsj- - • ^^n- 
i=i j=i 

For each individual product term we can apply the previous lemma. Note that we 
can use the same row/column samples for each product, because independence is 
required only within single matrix pairs. Summing the right-hand sides gives the 
statement of the lemma. □ 

Now we can complete the proof of Theorem 3.1: 

Proof. 

||w*-w'|U = IIGTFw* -GTi/w'lloo 

< \\GTHw* - GTHw*\\oo + \\GTHw* - GTHw'\\^ 

< \\GTHw* -GTHw*\\oc. + l\\^* -^'Woc., 

i.e., ||w* — w'lloo < j^\\GTHw* — GTHw*\\oo- Let ttq be the greedy policy 
with respect to the value function Hw*. With its help, we can rewrite THw* as 
a linear expression: THw* = r'^" + jP'^" Hw* . Furthermore, T is a component- 
wise operator, so we can express its effect on the downsampled value function as 
THw* =r^°+ -yP^^w*. Consequently, 

\\GTHw* -GTHw*\\oo < \\Gr'"> - Gr^°\\^ + -f\\GP''°H - GP^\\oo\\^*\\oo 
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Applying the previous lemma two times, we get that with probability greater than 
1 - 5i, IIGr'^" - Gr"°||cx> < eiCi if TV' > ^ log ^ and with probability greater 

than 1 - (52, \\GP''°H - GP^IU < £2(^2 ii N' > ^ log where Ci and 
C2 are constants depending polynomially on A^o and the norm of the component 
local-scope functions, but independent of N. 

Using the notation M = j^(^C'i + 7C2||w*||oo), £1=^2 = e/Af, Si ^ 62 = S/2 
and S = proves the theorem. □ 

Informally, this theorem tells that the required number of samples grows quadrat- 

ically with the desired accuracy 1/e and logarithmically with the required certainty 
l/S, furthermore, the dependence on the number of variables m is slightly worse 
than quadratic. This means that even if the number of equations is exponentially 
large, i.e.. = 0(e™). wc can select a polynomially large random subset of the 
equations so that with high probability, the solution does not change very much. 
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